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Abstract 

We introduce a new Markov chain Monte Carlo (MCMC) sampler called the 
Markov Interacting Importance Sampler (MIIS). The MIIS sampler uses condi¬ 
tional importance sampling (IS) approximations to jointly sample the current state 
of the Markov Chain and estimate conditional expectations, possibly by incorporat¬ 
ing a full range of variance reduction techniques. We compute Rao-Blackwellized 
estimates based on the conditional expectations to construct control variates for 
estimating expectations under the target distribution. The control variates are par¬ 
ticularly efficient when there are substantial correlations between the variables in 
the target distribution, a challenging setting for MCMC. An important motivating 
application of MIIS occurs when the exact Gibbs sampler is not available because it 
is infeasible to directly simulate from the conditional distributions. In this case the 
MIIS method can be more efficient than a Metropolis-within-Gibbs approach. We 
also introduce the MIIS random walk algorithm, designed to accelerate convergence 
and improve upon the computational efficiency of standard random walk samplers. 
Simulated and empirical illustrations for Bayesian analysis show that the method 
significantly reduces the variance of Monte Garlo estimates compared to standard 
MCMG approaches, at equivalent implementation and computational effort. 

Keywords: Bayesian inference; Control variate; Mixed Logit; PMCMC; Markov 
Modulated Poisson Process; Rao-Blackwellization; Variance reduction. 


1 


1 Introduction 


This paper introduces Markov interacting importance samplers (MIIS), a general Markov 
Chain Monte Carlo (MCMC) algorithm that iterates by sampling the current state from 
a conditional importance sampling approximation to a target distribution. An impor¬ 
tance sampling (IS) approximation consists of a set of weighted samples from a proposal 
distribution that approximates the target. Markov interacting importance samplers are 
conditional in the sense that the importance distribution may depend on the previous 
state of the Markov chain. The marginal distribution of the states converges to the target 
distribution for any number of importance samples at each iteration of the Markov chain; 
the algorithm does not induce an approximation error. 

We adopt importance sampling as a basic tool from the perspective that it can be 
more efficient than a Metropolis-Hastings sampler based on an identical proposal. Im¬ 
portance sampling naturally incorporates the information from all generated samples, 
while standard Metropolis-Hastings estimates lose information from rejected draws. In 
addition, importance sampling estimates are based on independent samples and as a 
consequence the method is immediately amenable to a range of variance reduction tech¬ 
niques (such as antithetic sampling and stratihed mixture sampling), as well as convenient 
to implement and parallelize. It is not standard practice in applied work to incorporate 
these features into Metropolis-Hastings approaches as they ar e more challenging to desig n 
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Importance sampling can be efficient when we are able to construct n umerically accu¬ 


rate a nc 

fenOTIl . 


computationally fa s t app roxim ations to a full t arget distribution. iRichard and Zhang 


Hoogerheide et al.l (120121) and 


Li et af 


(120131) are recent contributions in this 
area that have l ed to the a pplication of IS t o challenging problems: see for example 


Liesenfeld et ah 


(120131 ) and 


Tran et ah 


( 20141) . We motivate MIIS by observing that 


even if the joint target density is intractable by global approximation, we can frequently 
obtain efficient importance samplers for the conditional distributions. MCMC methods 
provide a natural way of handling large dimensional problems by sampling from con¬ 
ditional distributions (Gibbs sampling) or by generating samples from complex target 
densities through local exploration. The MIIS algorithm leverages the advantages of 
importance sampling in this setting. 

As a leading application, we consider the case in which it is not possible to imple¬ 
ment an exact Gibbs sampler due to infeasibility of direct simulation from the conditional 
distributions. The MIIS method relies on IS approximations of the conditional distribu- 


2 




































tions to sample the current state of the Markov Chain. The advantage of importance 
sampling is that we can additionally use the approximation (that is, all the generated 
samples) to estimate conditional expectations, possibly by incorporating the full range 
of variance reduction methods available for standard importance sampling. We compute 
Rao-Blackwellized estimates based on the conditional expectations to construct control 
variates for estimating expectations under the target distribution. The control variates 
are particularly effective when there are substantial correlations between the variables 
in the target distribution. This is a challenging setting for standard MCMC approaches 
because the conditioning scheme may imply strong serial correlation in the Markov chain. 

We introduce the general MIIS algorithm and present four examples that demonstrate 
its flexibility. The hrst two examples present the implementation of MIIS based on 
simple importance sampling targeting the full and conditional distributions. We derive 
conditions for the ergodicity and uniform ergodicity of the sampler. The third example 
introduces antithetic variables and is also uniformly ergodic under general conditions. 
The hnal example introduces the MIIS random walk algorithm, designed to accelerate 
convergence and improve upon the computational efficiency of standard random walk 
samplers. The random walk sampler is uniformly ergodic assuming that the importance 
weights are bounded. Ergodicity holds under milder constraints. 

Our method relates to the Particle G ibbs (PG) algorithm developed for Bayesian in¬ 


ference in general state space models by 


Andrieu et ah 


1 20101) . The PG algorithm itera¬ 


tively draws the latent state trajectories from its high-dimensional smoothing distribution 
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11201 4 a nd 
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Lindsten and Schon 
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Lindsten et al. 

12ni4bll 
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1I2018D and 


112014ri present ext ensions, while iGhopin and Singh 
study the theoretical aspects of 
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the algorithm. We can show that the particle Gibbs algorithm is a particular type of MIIS. 
Gompared to PG, the MIIS algorithm addresses a wider class of sampling problems and 
the use of variance reduction methods. 

We illustrate Markov interacting importance samplers in a range of examples. We 
consider the estimation of the posterior m e an for a Bayesian Mixed Logit model using 
the health dataset studied bv iFiebig et al.l 1120101) . The presence of unobserved hetero¬ 
geneous preferences in this discrete choice model motivates the use of MGMG methods 
that iteratively sample the model parameters and the latent choice attribute weights 
conditional on each other. The results show that the MIIS algorithm with control vari¬ 
ates increases efficiency in mean squared error by a factor of four to twenty compared 
to the Metropolis-within-Gibbs algorithm, which is a standard tool for problems that 
are not amenable to exact Gibbs sampling. We also implement the MIIS random walk 
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importance sampler for carrying out posterior in ference for Markov modu l ated P oisson 


processes, a problem considered for example by 


Fearnbead and Sherlock 


J2nn6li . 


Our 


analysis reveals four to hundredfold gains in efficiency over the st andard random walk 


Metropolis algorithm and the multiple-try Metropolis algorithm of 


Liu et al. 


fennnh . in 


this context, the improvements are mainly due to parallelization and better convergence 
of the Markov chain. 


2 Markov Interacting Importance Samplers 

To focus on the main ideas, we use densities in our mathematical discussion up to Section 
El We assume that the densities are defined with respect to measures that we leave 
unspecified for now. We provide a more precise treatment in Section [7] and the appendix. 

2.1 Notation and basic definitions 

This subsection presents some of the notation used in the article. We define the basic 
random variables on a set A that is a subset of Euclidean space. Suppose that f(x) is 
a real function with x G A. We take any density i/(x) on A to be with respect to some 
measure on A, which we denote as dx. We define the expected value of / with respect 
to the density ly as 

Kif) ■= j f{x)Ty{x)dx (1) 

provided the integral exists. 

In our article, 7r(x) is the target density. We often can evaluate 7i{x) only up to a 
constant of proportionality m{x), with 7i{x) = m{x)/Zm, where = J^m{x)dx is the 
normalizing constant. Suppose that Xi G A,i = 1,... ,N. Then, for we 

define i:j := {i, i -h 1,... ,j}, x^j := (x*,.. .,Xj) and x\k := (xi,.. .,Xk-i,Xk+i,.. .,Xn)- 

2.2 Conditional Importance Sampler 

This section introduces the conditional importance sampler (CIS) which is the basic build¬ 
ing block of the MCMC algorithms in this article. The CIS is motivated by the question: 
“how to implement an importance sampler approximation to vr that provides unbiased 
samples?"A\ie CIS is our solution to this problem. We go beyond simple importance 
sampler and construct a general framework that not only covers the simple importance 
sampling approximation with variance reduction techniques, but also extends the basic 
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importance sampling paradigm, allowing local exploration of the target inside an MCMC 
setting, for instance, by using a random-walk approach. 

At each iterate of an MCMC algorithm, the CIS constructs an empirical approxi¬ 
mation to the target density vr(-). It generates an auxiliary variable ^ and N particles 
Ai-tv conditional on the previous iterate y, in such a way that one particle is gen¬ 
erated through a Markov transition kernel and the other iV — 1 particles are generated 
conditional on Xk- 

We now present a more precise description of the CIS. Let ri{^\y) be the conditional 
density of the auxiliary variable with ^,y G A, and take ri{^) = f '>](^ly)7r(y)dy so 
that vr(j/|,^) = 'y(^ly)7r(y)/7](^). Let T{y,x;^) be the density of a Markov transition 
kernel from y to x G A, conditional on that is reversible with respect to 7r(|/|^); i.e., 
T^iy\OTiy,x;^) = TT{x\^)T{x,y;^), or equivalently, 

T^iy)v{^\y)T{y, x; 0 = 7r(x)rj(^jx)T(x, y; 0- (2) 

Given ^ E A, let q(xi: 7 v|.^) be a joint importance distribution with marginals qi{xi\^) 
[i = 1,. .., N). For any 1 < A; < iV, dehne the conditional density 

(l\k{x\k\xk,0 ■= 

Definition 1 (Conditional Importance Sampler). For any given y E A and 1 < k < N, 
the Conditional Importance Sampler generates Xi.^,^\{y,k) from the probability distri¬ 
bution 

T^{xi..n, ^\y, k) := y{i\y)T{y, x^, f) c[\k{x\k\xk, 0- (4) 

The auxiliary variable rj introduces dependence in the importance sampling approxi¬ 
mation. Moreover, we can often choose the auxiliary density rj so that Wi{x-,f) is bounded. 
For instance, the random-walk importance sampling algorithm chooses yi^lx) = q{x\f) = 
0 ( 1 ^ — x|). The weights are Wi{x;^) = m{x), which are bounded if m{x) is bounded. 
The dependence on ^ can be easily dropped if one takes ri{-\y) = r]{-) and each = 

qi{-)- The Markov transition kernel T{y,-]f) can be taken as the identity kernel, i.e., 
T{y, sO = ^(' ~ y)j which is our choice in Sections [3] and El A Metropolis-Hastings 
kernel targeting 7r(-|.^) is also a valid choice. 

The CIS generates {Xi,n,^) using the following algorithm. 

Algorithm 1 (Conditional Importance Sampler). Given {y,k), 

1. sample ^^T]{^\y); 
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2. sample Xk ~ T{y,Xk;^); i.e., generate the particle Xk using the Markov kernel. 

3. sample X\k ~ (l\k{x\k\xk,0> generate all the remaining particles conditional 
on ^ and the propagated particle Xk ■ 


From the output of the Conditional Importance Sampler we dehne the weights for 
i = 


Wi{xi.,N-,0 


Wi{xi;^) 


where Wi{x-,^) 


m{x) 

qi{x\0 


r]{^\x) 


( 5 ) 


and let ■= hFi(a;i:Ar, .^)),... ,{xn, hFjv(a^i:iv, 0)} be the empirical approximation 
to 71. The weights depend on the marginals gi(-|0 (* = of q(xi:Ar|^), the 

auxiliary distribution and the target distribution 7r(-) oc m(-). Based on we 

dehne the estimator of as 


N 

Eas(f) ■= E W'iUi.K.O/ni) = E,gjf). (6) 

Dehne the joint density 


7r^{k,y,Xi.,N,0 ■= N ^7r{y)T^{xi.,N,^\y,k). 


( 7 ) 


Lemma [T] gives some fundamental properties of n^{k,y,xi.,N,^) and shows that the ex¬ 
pectation of E^jg{f) is Et.^/) if the marginal distribution 7i^{y,k) = N~^7i{y). We use 
E^ig^f), additively, within an MCMC scheme to construct unbiased estimators of E^^l^f). 
The unbiasedness property is critical for the variance reduction techniques in Section O 

Theorem 1. Suppose that E^^dfl) is finite, {k,y) is a sample from N~^7i{y), and that 
{xi:N,0 generated from T^{xi.,N,^\y,k). Then, 

(i) 71^{y) = 7T{y). 

(ii) 


N 

n^{k,y\xi,N,0 = = i)T{xi,y] f), ( 8 ) 

i=l 

or eguivalently. 


71^ {K = i\xi.,N,i) = Wi{xi.,N,i) and 7T^{y\xi.,N^,k) = T{xk,y;0- ( 9 ) 
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(tit) =-E,(/). 

Remark 1. We now compare importance sampling to conditional importance sampling. 
In importance sampling, we draw particles xi.,n from an importance or proposal density 
q(a^i:Af) with marginal densities qi{xi) and calculate their importance weights 

\ Wi{xi) 

Wi[xi.,N) ■= —]v-^—T, where Wi[xi) 

to obtain the approximation rtfg := {Wi:]\f{xi:jtf), xi^^} to n. The IS sampling estimate of 

EM) 

N 

E^'sif) ■■= E (10) 

i=l 

In the simplest case, the particles a;i: 7 v o^re sampled independently from the same proposal 
distribution q, i.e., qi = ■ ■ ■ = qN = q and q(a;i:Ar) = Despite similarities, 

there fundamental differences between using fcis ^fs- 

1. The marginal distribution of a sample X from nfg is not 7i{X), while the distribution 
ofY fromnQjg isn(Y). Similarly, 

EfEP(f))ytE,(f), (11) 

whereas = Ei{f). 

2. The weights W) in the CIS may depend on an auxiliary variable f,, with density 
rif\y), that incorporates past information in the proposal opening the possibility for 
using local proposals. Moreover, it can be used as a mechanism to bound the weights 
and provide more robust estimators. 

2.3 Markov Interacting Importance Sampling Algorithm 

The Mils algorithm simulates from the target distribution tt on A. It iterates by hrst 
constructing a discrete approximation to tt using the CIS, conditional on the previous 
state {y, k) of the Markov Chain, and then samples from the approximation. It requires 
specifying a joint proposal distribution q(a;i:Ar; .^), an auxiliary distribution ri{f,\y)j a 
Markov transition kernel T{y,x-,f). 

Algorithm 2 (Markov Interacting Importance Sampler). Given y^^^ E A and 1 < < 

N, at step t = 1,2,... 


qMi) ’ 
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1. Generate ~!?/*■* ^^)- 

Generate ~ T ■ 

3. Generate 








For k = 1,..., N, calculate 


, W A 


(i)^ 




f^i;K'‘>|4‘>), and W't(4‘4?''’) 


i"* (4‘’i5) 


EL™i(4';« 


.w. 


Draw yjith probability W^fc(a^i*^Ar, 

5. Generate ^ ^(t)^ ^ 

We divide the algorithm into two blocks. The hrst block consists of steps 1 to 3 and 
uses the CIS to draw an approximation to vr. It corresponds to Algorithm [T] in Section [2]2j 
The second block consists of steps 4 and 5 and draws an element from this approximation. 
It corresponds to part (ii) of Theorem [TJ 

The Mils algorithm is a Gibbs sampler on an augmented space that contains all 
variables sampled in the CIS step, i.e., it is a Gibbs sampler targeting ([7]). It also follows 
that if ~ A^“^7r(-), the marginal distribution of is the original target vr; 

the Mils algorithm generates samples from vr without the approximation error induced 
by the CIS step. 

Theorem 2 (Target Distribution). The Markov Interacting Importance Sampler is a 
Gibbs sampler targeting the augmented density o that has niy) as a marginal density. 


3 Examples 

This section illustrates the MIIS methodology in three useful examples. For simplicity, 
the Markov transition density is set to the identity density, i.e., T{y,x;^) = Sy{x), which 
denotes a density inx that integrates to 1 and which is zero exact a.t x = y; we will 
sometimes write it as d{x — y). We do not use the auxiliary variable ^ in the first two 
examples, which is equivalent to assuming that ri{^\x) = ri{^) and q(iri:Ar|,^) = q(xi:Ar). 
Section \T72\ gives formal convergence results for all three examples. 








3.1 Simple Importance Sampling 


This spec ification corresponds to the iterated Sampling Importance Resampling algorithm 
(i-SIR) in lAndrieu et all (120131 ) . In importance sampling algorithms we generate particles 
independently from importance distributions qi{x) = q{x) {i = i.e., Xi-^n ~ 

11^=1 Hence q(a;i:Ar|0 = Ilf=iq{xi) and 


N 

(l\kix\k\xk,k,^) = 

i^k 


The CIS in this case is 

N 

T^{xi.,N^\y, k) = - Xk) JJ q{xi). 

i^k 

Algorithm [3] follows from Algorithm [21 

Algorithm 3 (MIIS with Simple Importance Sampling). Given and = k, 

1. Generate xf"* ~ q{x), for i = {l-.N}\k, and set 

2. Draw = k\xf^^ with probability proportional to Wk{x^k) = m{xf^)/q{x^l'^). 

3. Set y^^'> = x^^ly 

3.2 Importance Sampling with Antithetic Variables 

In the importance sampling literature, the method of antithetic variables consists of 
drawing perfectly negatively correlated particles to reduce the variance of the Monte 
Carlo estimate. We can use this method within the MIIS framework. The importance 
sampler with antithetic variables draws the particles in pairs from a proposal distribution. 
Suppose that N is even. For k < N/2, let qk{xk) be the density of Xk with corresponding 
cumulative distribution function Qk{-) and let XM/ 2 +k = Qk^{^ ~ Qk{xk)), where is 
the inverse of Qk- We write the joint density of Xk,XN/ 2 +k as 

qk,N/2+k{Xk,XN/2+k) = Qk{Xk) 5 Q-i(^i_Q^(^^^)){xN/ 2 +k)- 

The marginals are qk{x) = qN/ 2 +k{x) and the conditional density of Xk given XN/ 2 +k is 
qk{xk\xN/2+k) = For notational simplicity assume k < N/2. We 
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sample the particle system given [xk, k) from 

N/2 

q.\k{x\k\xk,^,k) = (5Q-i(i_Qj^(^^))(a;Ar/2+fc) ]^gi,Ar/2+i(a^i,a^Ar/2+i) 

i^k 

_ Yl^l 1i,N/2+iiXi, XN/2+i) 
qk{xk) 

_ q(a^i:jv) 
qk{xk) ’ 

and the CIS is 

■r-rA^/2 / \ 

qk[xk) 

Algorithm 4 (MIIS with Antithetic Variables). Given |/h-i) and = k, 

1. Generate {xi^\XN/ 2 +i) ~ qi,N/ 2 +iixi,XN/ 2 +i), for i = {l:iV/2} \ k. 

2. If k < N/2, set xf^ = and XN/ 2 +k = Qk^{I - Qk{xf^)). If k > N/2, set 

^k^ = and Xk-N/2 = Q//-N/2O- - Qk-N/2{xf)). 

3. Draw = k\xf^^ with probability proportional to mfx^l'^)/qkfx^^'^). 

4. Sety^^^ = xfa)- 

3.3 Random Walk Importance Sampler 

The random walk importance sampler draws particles from a symmetric proposal depen¬ 
dent on its past. The advantage is that the method bonnds the weights by construction. 
The random walk proposal performs local exploration around the auxiliary variable 
which we sample conditionally on the previous state. 

Let q{-\y) = y{-\y) = </{■ — y) denote the proposal functions for g* and rj. Then 

N 

q.\kix\k\xk, A;, 0 = n ~ 

i^k 

The CIS is 

N 

r^(a;i:Ar, ^\y, k) = 6y{xk)4>{f - Xk) JJ 4>{xi - 0- 

i^k 

The random walk importance sampler bounds the weights if m{x) is bounded. The 
sampling algorithm follows from Algorithm [2] 
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Algorithm 5 (MIIS with Random Walk proposal). Given and = k, 

1. Generate ~ 

2. Generate ~ (j){x — for i = {1:-/V} \ k, and set a;[*^ = 

3. Draw = k\x^^^^ with probability proportional to m{x^l'^). 

4. Set y^^^ = x^^ly 


4 MIIS Targeting Conditional Distributions 

This section shows how to nse nse the MIIS algorithm within a Gibbs sampling framework. 
We nse the following notation. Snppose we partition a; G A as {a;(l), ... ,x{d)}. Then, 
for 1 < s < t < d, x{s:t) := {x{s),x{s + 1),... ,x{t)},Xi{s:t) := {a;i(s), ... ,Xi{t)}, etc. 
We dehne Ag := {a:(s):a; G A} and A \5 := {x(\s):x G A}. For a density z/(a:), x E A, 
we dehne the conditional density h's{x{s)\x{\s)) := z/(x)/i/(a;(\s)) and the conditional 
expectation 


-^i^s(R(\s)) (/) 



/(a;)z/ 5 (a;(s)|a;(\s))da;(s). 


( 12 ) 


4.1 Conditional Importance Sampler for conditional distribu¬ 
tions 


The CIS for conditional distribntions is similar to the CIS in Section 12.21 bnt now 
targets 7is{x{s)\x{\s)), s = l,...,d. Given y E A, s E {l:d} and kg E let 

Vs{^{s)\y{s),y(\s)) be the density of the anxiliary variable ^(s), conditional on y. Let 
Tg{y{s),Xk,{s);^{s),y{\s)) be a the density of a Markov transition kernel, conditional on 
(^(s), 2 /(\s)), that is reversible with respect to 7r^(2/(s)|^(s), 2 /(\s)) oc TTg{y{s)\y{\s))T]g{^{s)\y{s),y{\s)). 

Given ^(s) and y(\s), let qs(xi:Ar(s)|^(s),?/(\s)) be a joint importance density with 
marginals qs,i{^i{s)\^{s),y{\s)) {i = 1,... ,N), and 




q.(a^i:v(g)|^(g),l/(\g)) 

g.,fc«(a^fe«(s)|^(s),|/(\s))' 


(13) 


Definition 2 (Conditional Importance Sampler for conditional distribntions:). For 1 < 
s < d, y E A, and kg E {1:A}, the Conditional Importance Sampler for conditional 
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distributions generates Xi:]\f{s),^{s)\{y{s),ks,y{\s)) from the probability distribution 


rf (xi:iv(s), ^(s)||/(s), ks, y(\s)) 


Vs{.i{.s)\y{s),y{\s)) T,(|/(s), a;fc,(s); ^(s), ?/(\s)) 

X qsAfc,(a^\fc,(s)kfc.(s),/cs,^(s),2/(\s)). (14) 


In the CIS for conditional densities, we first generate ^(s), then we generate Xk^{s) 
conditional on ^(s), and finally the remaining particles x\k^{s) conditional ^(s) and Xk^(s) 
Suppose we express the target 7rs(a;(s)|a;(\s)) oc ms(x(s)lx(\s)), where we can evaluate 
ms(x(s)lx(\s)). From the output of the CIS for conditional distributions, we dehne the 
weights 




w.,i(xi(g);^(s)||/(\s)) 


(15) 


where 


w^.,i(a^i(s);'^(s)l2/(\s)) 


ms(xi(s)jy(\s)) 

Qs,i(xi(s)j^(s),y(\s)) 


^.(^(s)ki(s),2/(\s)) 


(16) 


and consider ^^cis('iy(\^)) ■— ^i('5))) ■ ■ ■, (hFs.v, a;Af('S))} as an empirical approx¬ 
imation of 7rs(-jy(\s)). Based on dehne the estimator of F^^^(.|j^(\s))(/) as 


N 




2=1 


Ws,i(Xi 


(s);^(s),l/(\s))/(a;i(s),|/(\s)) = (17) 


Analogously to the CIS, dehne the joint density of (Kg, V(s), Xi.jy(s), f(s) conditional 
on V(\s) as 

(^s, 2 /(s),a:i:jv(s),^(s)||/(\s)) := (a;i,7v(s), ^(s)|?/(s), A;,, 2 /(\s)). (18) 

Lemma [3] gives some properties of the density ffTSjl and shows that if (kg, y(\s)) is gener¬ 
ated from N~^7rg(y(s))jy(\s)) then the expectation of E^Qig{f) is i77rs(-|j/(\s))(/)- 

Theorem 3. Suppose {ks,y{s)) be a sample from N~^7ig{y{s))\y(\s)), and {xi:n is),as)) 
a sample from Tf {xi:N{s),f{s)\y{s), kg, y{\s)). Then, conditional ony{\s), 

(i) n^{y{s)) = Tig{y{s)). 

(a) The conditional density of kg,y{s) given Xi,N{,s),f{s) is 


Trf {ks, y{s)\xi,N{s),^{s)) = Wg^ksT{xg^{s),y{s)-, f{s)) 
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or equivalently 


vrf {ks\xi-,N{s),^{s)) = Ws,k, and 
i.yi.s)\xi.,N{,s),i{s),ks) = Ts{xkXs),y{.s)]i{s)). 

(Hi) (/). 

4.2 The Markov Interacting Importance Sampler within Gibbs 

The algorithm extends the MIIS sampler targeting the full density. It simulates sequen¬ 
tially from the conditional distributions 7ri(j/(l)|j/(\l)),..., Hfi{y{d)\y{\d)), using the CIS 
approximation to the conditionals. The method is an alternative to the Metropolis- 
within-Gibbs algorithm that is is suitable for the application of the variance reduc¬ 
tion techniques in Section O The MIIS within Gibbs sampler requires the specihca- 
tion of joint proposal distributions {qs(xi:Ar(s)|^(s), ?/(s), |/(\s)}, auxiliary distributions 
{ys{i{s)\y{s),y{\s))}, and Markov transition kernels {Ts{y{s),XkXs)]^{s),y{\s))}, for 
each s = 1,..., d. The general form of the MIIS Gibbs sampler is given by Algorithm [B] 


Algorithm 6 (The Markov Interacting Importance Sampler within Gibbbs). Given 
y^^^ G A and 1 < kf''^ < N, s = 1,... ,d, the algorithm at step t = 1,2,, is de¬ 
scribed as follows, with all terms conditional on 1) and y^^~^\s-\-l\d). 


1. For s = 1,... ,d, 


1.1. Generate ^ rjs{f^{s)\y^*' ^^(s)). 

1.2. Generate 



1.3. Generate 


V- 


(s) ~ q. 


\Af- 





1 


1.4 


conditional on a:*-*(s), fci* (s), y^^^ (\s). 

kg 

Draw = k\{xf^j^{s) ,with probability proportional to 





rris 


qs,k 
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1.5. Generate 


2. Set y^^'> = ..., y^^\d))'. 

For each partition s = 1,..., d, the algorithm iterates as in the MIIS algorithm. Steps 
1.1 ~ 1.3 construct an approximation ^s('l2/(\'S))- Steps 1.4 and 1.5 then draw 

an element from this approximation. As before, the MIIS for conditional distributions 
is a Gibbs sampler on an augmented space that contains all variables sampled in the 
CIS step. It also follows that the marginal distribution of y^^^ is the original target vr. 
Theorem 0] shows the augmented target distribution and that it generates samples from 

71. 

Theorem 4 (Target Distribution). The Markov Interacting Importance Sampler is a 
Gibbs sampler targeting the augmented distribution given by 

^^{y, C, a;i:iv(l), • • •, xi:N{d),ki:d) = Yl {xi-.N{s),^{s)\y{s), ks, y{\s)), (19) 

and has N~^7i{y) as a marginal distribution of {ki.d^y). 

4.3 Example: MIIS within Gibbs with Simple Importance Sam¬ 
pling 

The MIIS sampler takes the conditional distributions in the Gibbs sampler as the target 
distributions for the conditional importance samplers. Suppose that we use a simple 
importance sampling algorithm to construct the CIS approximation. Then, for each 
s = 1,..., d. 


N 

{xi.,N{s),f{s)\y{s),ks,y{\s)) = r]{f{s))5{y{s) - Xk{s)) JJ gs,i(xi(s)), 

for proposal distributions qs,i{xi{s)) = qs{xi{s)). 

The distribution of the marginal sequence generated by this algorithm converges to 
the full target vr as the number of iterations increases under suitable regularity conditions 
that are given in Section [71 

Next algorithm follows from Algorithm [6l Corollary 0] in Section 17.31 gives formal 
convergence result for Algorithm [71 


14 





Algorithm 7 (MIIS for Gibbs Sampler with Simple Importance Sampling). Given 
and kfl, 

1. for s = 1,... ,d 

(a) Generate ~ qs{xi{s)), fori = {l:N}\ks, and set = 

(b) Draw = k\{xf^j^{s),y^^'>{\s)) with probability proportional to the weight 

{ms{xt\s)\y^^\\s))/qs{xf{s)). 

(c) Update y^^\s) = x^%{s). 

ks 

2. t = t+1 

5 Estimation of expectations using variance reduc¬ 
tion methods 

Variance rednction techniques play a central role in Monte Carlo integration. We can di¬ 
rectly embed variance reduction methods such as antithetic sampling into the conditional 
importance sampling approximation. This section takes a step further and considers vari¬ 
ance reduction methods based on the output of the MIIS algorithm. Suppose that the 
algorithm targeting vr runs for M iterations. The simplest estimator of ET^{f), which uses 
only the output from the Markov Chain, is 

1 ^ 

Emc(/)^=T7E/C) =%!„(/) (20) 

t=l 

where = {(1/^) ..., (1/iV, x*^^^)}. 

We can improve efficiency by reusing all the particles, constructing Rao-Blackwellized 
estimators, and using control variates. Section 17.41 shows that all the estimators in this 
section are consistent under ergodicity. We assume throughout this section that the 
chain has reached the stationary distribution before running M iterations of the algo¬ 
rithm. In this case the estimators are also unbiased. In the practical situation where the 
initialization is arbitrary, the estimators are asymptotically unbiased in M for a fixed N. 
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5.1 Reusing all the particles 

The Mils algorithm constructs an unbiased approximation 


N 

E^rsAf) ■= E f (T*) (21) 

i=l 

to at each iteration t of the Markov chain, after the chain has converged. The 

Mils estimator that averages over the terms 

Eil)ls(f ) M E ^c.sAf) = -Emc {E^isAf)) ( 22 ) 

^ t=l 


5.2 Rao-Blackwellization 

The motivation for Rao-Blackwellized estimators is that the variance of f{x{s)) is larger 
than the variance of T^^^(.|a;(\s)(/). However, the latter requires knowledge of the con¬ 
ditional expectation in closed form. The MIIS for the Gibbs sampler overcomes this 
limitation by using an unbiased approximation of the unknown conditional expectation. 
It follows from Theorem [3] that, at each iteration t of the Markov chain, the term E^(jjg{f) 
is an unbiased estimator of i? 7 rs(-|x(\s)(/)- For each s = 1,... ,d, dehne 


^M,N 

^s,RB 


M 


(A)-i7}_EAc,sAf) 

^ t=l 


(23) 


where 


rpN 

^s,CIS,t^ 


N 

if) = E'^*.' (^i‘!v(s);5f'k“>(\s)) / (T) 

i=l 


(24) 


and xf^ = {xj(s), and x^^\\s) = : s—1), : d)}. 

We define the Rao-Blackwellized MIIS estimator for the Gibbs sampler as the average 
of the marginal Rao-Blackwellized estimators in fl23|) . 


pM,N 

^MIIS 


if) 


1 

d 




S=1 


(25) 


Both the marginal Rao-Blackwellized MIIS estimators E^^if) and the Rao-Blackwellized 
MIIS estimator for the Gibbs sampler E^’j^g{f) are unbiased estimators of Ej^if) and 
converge to E^^i^f) with probability one as M ^ cx), for any N > 2. 
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5.3 Control Variates 


It is optimal to further combine the simple Monte Carlo estimator and the MIIS estimator. 
For j = 1,... ,p, suppose that gj{x) is an integrable function with respect to the density 
TT and U{gj) a real function such that E:^N(U(gj)) = 0. Let k = (ki, ..., Kp) be a pxl 
vector of parameters and let F = f—Y7j=i '^jU{gj)- For an optimal choice of n, we would 
like the variance of the estimate of the posterior mean of F to be smaller than that of /. 
The variables U{gi) are the con trol variates. The M o nte C arlo estimator using F in place 
of / is studied in many settings; iRobert and Casellal (1200411 and lLinI (1200111 . among others, 
discuss the standard case. Control variates are not commonly used in an MCMC setting 
because the Markov sampling scheme makes it more difficult to hnd suitable candidate 
control variates with mean zero. 

Define E^j^^^gj) similarly to (l?I]l and 


C(fi'j) gjix^^^)—EcjsA9j) 


(26) 


Assuming ergodicity, the samples from the MIIS Markov chain are eventually dis¬ 
tributed as and n^[Ut{gi)] = 0 as required. The estimator with control variates 

is 


pM,N 

Ecv 




M 




t=i 

M 


i=i 

p 




t=l 


i=i 


rpM 

^MC 


i=i 


= Ef,rAF). 


(27) 


An alternative compact notation shows how we combine the previous estimators, 


^M,N, 
^CV 


(/; «) = E^c{f)-y. S- E^c{9^)-E^MiU9,) 


p 


i=i 


nM 


pM.AT 


(28) 


In a simple case we may have for example p = 1 and gi{x) = f{x), which allows us to take 
advantage of the typically high correlations between the simple MC and MIIS estimators 
of EM). 

The optimal choice of coefficients k, (in the sense of minimizing the variance of the 
estimator) solves the problem of projecting E^^{f) on 'Mj=i ^jEMcif^igj))- solution 
is K* = Mu^uf, where Mu = E{EMiU)xEMiUy) and Mf = E{EMiU)xEMif)), 
where the expectations are with respect to all the random variables generated by a MIIS 
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Markov Chain with M iterations. In onr appli cations we estima t e the covariances by 
nsing the overlapping batch means method as in iFlegal and JonesI (1201 ll) . 

We can also nse control variates in a Gibbs sampler setting. Onr es t imato r generalizes 
the control variates approach nsed bv iDellaportas and Kontovianni.sl (120121) . which only 
applies to exact Gibbs samplers. For a fnnction / and fnnctions Qsj that are integrable 
with respect to vr, 


^M,N 

'^s,CV 


if-K) ■.= E^cU)- 


d Ps 

s=i j=i 


K 






(29) 


We estimate the optimal parameter k, = {ki^, ..., ^ 2 ^ 1 , ■ ■ ■, ^ 2 ,^ 2 ; • • • > ^d,i, ■ ■ ■, 

as above. 


6 Illustrations 


6.1 Gibbs sampler with importance sampling 

6.1.1 Sampling from a bivariate normal distribution 


In this example we sample from a simple bivariate normal distribntion to compare the 
performance of the MIIS sampler with control variates to the Metropolis-within-Gibbs 
fMwG) sampler in a setting in which the exact Gibbs sampler is available as a reference. 
Dcllaportas and Kontoviannis ( 20121) adopt this example to illustrate their use of control 


variates for the Gibbs sampler. The purpose of this example is to show, in a simple 
setting, that the MIIS sampler with control variates performs well relative to the MwG 
and G ibbs samplers. We also present result s for the Gibbs sampler with control variates 
as in ( Dellaportas and Kontoviannis! . 2OI2I) . which we regard as the ‘gold standard’for 
this problem. Beyond this example, we make the important point that the MIIS and 
MwG samplers do not require being able to sample from exact conditional distributions, 
whereas it is necessary to sample from the exact conditional distributions for the Gibbs 
sampler. All the methods are very simple to implement for this example. The target 
distribution is 

n{x) oc exp 



E = 

1 

P 

V 2 


_ P 

1 


where p G {0.25, 0.5, 0.99} represent low, moderate and high correlation. 

We are interested in MGMG estimators of the mean, variance, covariance, a tail 
probability of the marginal distribution of a;(l), i.e., E.„{X{1)), Et,{X{1)‘^)—E.„{X{1))‘^, 
E^{X{1)X{2))-E^{X{1))E^{X{2)), E^(/[X(1) < -2.32]) = Pr(X(l) < -2.32), 
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We implement the MIIS algorithm of Section H73] (Algorithm 6). We separately con¬ 
sider the standard case and the use of antithetic variables as in Section 13.21 (Algorithm 
4). The importance distribution qs,i{xs,i) for the MIIS method is a Student t with 5 de¬ 
grees of freedom, shifted and rescaled to have the same mean and variance as the target 
conditional distribution 7 is{x{s)\x{\s)). We use the same proposal for the MwG sampler. 
The number of particles in the IS approximation is A^ = 50. To make the Gibbs and 
MwG algorithms comparable to MIIS, in these methods we sample 50 iterates of W(l) 
(X(2)) conditional on the current state of X{2) (X(l)) in the chain. 

We use control variates of MIIS as in Section 15.31 The estimator is given by (1251) . 
where we consider at least two control variates for each moment estimate 


M 


M N 




t=l 


t=l i=l 


with 1/2 = expressed similarly. The control variates are 

the differences between the standard MGMG estimates and the corresponding Rao Black- 
wellized MIIS estimates. We consider additional control variates for estimating the tail 
probability and ii^^(X(l)X(2)). For the tail probability, we include the same control 
variates used for mean estimation. For estimating Ej^(X(l)X(2)), we incorporate the 
control variates used for estimating the rn e an an d variance. We apply the overlapping 
batch means method in iFlegal and Jones! (120111 ) to estimate the covariance matrix of 


the standard estimator (120]) and the control variates based on the output of each chain. 
That allows us to estimate the optimal coefficients for the control variates as described 
in Section 15.31 

Table H] summarizes the results. We report the estimated mean square error (MSE) 
relative to the MwG sampler based on 500 independent Markov Ghains with 10,000 
iterations (after a burn-in period of 1,000 iterations) . The results reveal that when 
the correlation in the target bivariate normal distribution is pronounced (p = 0.99), the 
MIIS method with control variates improves the MSEs for estimating the mean, variance, 
and covariance by 98-99% compared to the MwG sampler. The control variates efficiently 
explore the information in the chain and the high correlation between the two variables to 
reduce variance. The results for the covariance estimators show that the MIIS approach 
can work well when estimating expectations which involve variables in different blocks 
of the sampler. Introducing antithetic variables in the conditional importance sampler 
leads to a 99.8% reduction in MSE compared to MwG. Despite the high correlation in the 
target distribution, the MIIS estimator with antithetic variables takes advantage of the 
fact that the mean of the proposal is the exact conditional mean. As p becomes lower, 
the MIIS-GV method displays a lower but still large reduction in MSE in comparison 
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to MwG. The table also shows that as in 


Dellaportas and Kontoviannisl (1201211 . the use 


of control variates in the Gibbs sampler is highly efficient. The disadvantage with the 
Gibbs-GV method is the requirement that the Gibbs sampler is feasible in the first place, 
whereas the MIIS-GV applies generally. This simulation exercise illustrates that in many 
situations, accurate estimation of conditional expectations using MIIS will translate into 
accurate estimation of expectations under the target distribution with the use of control 
variates. 


Table 1: Bivariate Gaussian simulation - Monte Garlo MSE of target density expectation 
estimates relative to MwG. 



Gibbs 

Gibbs-GV 

p = 0.99 

MwG MIIS-GV 

MIIS/A-GV 

Mean 

1.087 

0.002 

1.000 

0.011 

0.002 

Variance 

0.805 

0.001 

1.000 

0.011 

0.001 

Govariance 

0.789 

0.001 

1.000 

0.022 

0.002 

P{X{1) < -2.32) 

0.942 

0.746 

1.000 

0.966 

0.874 


Gibbs 

Gibbs-GV 

p = 0.5 

MwG MIIS-GV 

MIIS/A-GV 

Mean 

0.931 

0.000 

1.000 

0.025 

0.000 

Variance 

0.974 

0.000 

1.000 

0.177 

0.225 

Govariance 

0.988 

0.000 

1.000 

0.066 

0.022 

F(X(1) < -2.32) 

0.906 

0.148 

1.000 

0.270 

0.240 


Gibbs 

Gibbs-GV 

p = 0.25 

MwG MIIS-GV 

MIIS/A-GV 

Mean 

0.944 

0.000 

1.000 

0.073 

0.000 

Variance 

0.830 

0.000 

1.000 

0.493 

0.850 

Govariance 

0.973 

0.000 

1.000 

0.167 

0.025 

P(X(1) < -2.32) 

0.810 

0.020 

1.000 

0.179 

0.179 


6.1.2 Mixed Logit Model 

We consider posterior simulation for the Mixed Logit (MIXL) model as a substantive 
applied example where it is necessary to apply a method such as importance sampling 
within Gibbs or Metropolis-within-Gibbs. The binary Mixed Logit model specifies the 
probability that an individual chooses a certain alternative j = 1 (over j = 0) at occasion 
f as ^ 

p{i chooses j = 1 at t\Zit,/3i) = (^ 3 q) 

l+exp(/3o.+Ef=ite.O 
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where 6i = {Poi, Pu,..., f^Li)' is the vector of utility weights for individual i and Zu = 
{ziit ,..., zlu)' is the corresponding vector of attributes for the choice. The individual 
specific constants are (3oi = (3o+Voi with rjoi ~ N(0, a^) and the attribute weights for each 
individual are latent variables with specihcation 




i = h 


(31) 


with r/ii ~ N(0, erf). 

The parameter vector is 0 = (/do, ctq, /di,..., /3£, af,..., crl)', while the vector of la¬ 
tent variables for each individual is (i = (/do*,... ,/dLi). The Mixed Logit model captures 
heterogeneity in preferences by allowing individuals to weight the choice attributes differ¬ 
ently. By introducing taste heterogeneity, the MIXL specification avoids the restrictive 
indepe ndence of irre l evant alternatives (HA) property of the standard multinomial logit 


model (IFiebig et ah 
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3 ). 


We consider an empirical appli cation to the Pap smear data set used for simulated 


maximum likelihood estimation in 


Fiebig et al.l (120 lOh . In this data set, / = 79 women 


choose whether or not to have a Pap smear test on T = 32 choice scenarios. We let 
the observed choice for individual i at occasion t be yu = 1 if the woman chooses to 
take the test and ya = 0 otherwise. Table [2] lists the choice attributes and the associated 
coefficients. We impose the restriction that ct| = 0 in our illustrations since we have found 
no evidence of heterogeneity for this attribute. To simplify the computational algorithm 
for this example given this restriction, we fix at the maximum likelihood estimate. 


Table 2: Choice attributes for the pap smear data set 


Choice attributes 

Values 

Associated parameters 

Alternative specific constant for test 

1 

Yo, CTo 

Whether patient knows doctor 

0 (no), 1 (yes) 


Whether doctor is male 

0 (no), 1 (yes) 

A,cr2 

Whether test is due 

0 (no), 1 (yes) 

/^3,0-3 

Whether doctor recommends test 

0 (no), 1 (yes) 

Yai CTa 

Test cost 

{0, 10, 20, 30}/10 

/Ss 


We specify the priors as (3^ ~ N(0,100), an oc (l-t-ag)”^, A ~ N(0,100), ai oc 
(l-|-aA~^, for / = 1,..., L. We follow iGelmanl (120061) and impose half-Cauchy priors on 
the standard deviation parameters. 

In the general notation of the paper, we want to simulate the posterior distribution 

ofx = {O'■ ■ ■ XiY■ 
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6.1.3 Results 


We focus on the estimation of the posterior mean of the model parameters, that is 
E^{/3o), E^{ao), E^{/3i), -E^(/34), E^{ai), E„{a4). 


We implement MIIS and Metropolis-within-Gibbs algorithms that iteratively sam¬ 
ple the parameters (a;(l) = 6) and the choice attributes for all individuals {x{2) = 
conditional on each other. Equation flMjl implies that conditional on /3ii 
for all i and Z = 0,1,..., 4, the posterior of 6 factorises into five components with Gaus¬ 
sian conditional likelihoods from which we can independently sample the corresponding 
mean and standard deviation parameters. As before, the number of importance sam¬ 
ples for the MIIS method is = 50. We generate 50 iterates of x{s) conditional of the 
previous value of x{\s) in the MwG algorithm to make the two approaches comparable. 
The proposal for the i ndividual choice attr i butes combines the efficient importance sam- 
pl ing fEIS) rnethod of iRichard and Zhand (120071) with the defensive sampling approach 
of IHesterberd (119951) . The importance density is the two component defensive mixture 


qiCilyn, • • •, Vix) = ..., 2/iT) + (l-w)p(0), 


where ... ^yir) i s a multivariate G aussian importance density obtained using 

the EIS method. Following IHesterberd (119951) . the inclusion of the state prior p(0) in the 
mixture ensures that the importance weights are bounded. We set the mixture weight as 
u = 0.5. We also use the EIS method to obtain the importance parameters for the hve 
bivariate parameter proposals (the conditional maximum likelihood estimates are easy 
to implement alternatives which we use to initialise the EIS method) and incorporate 
antithetic variables throughout. 

We consider the same set of twenty control variates for each MIIS estimate. The hrst 
set of control variates are based on the parameters 6, 


E^^c{0,)-EZi]s{0,). forj = l,...10. 


nM,Ar 


These control variables are the differences between the standard MGMG posterior mean 
estimates and the MIIS Rao-Blackwellised estimates. We additionally use two types of 
control variates based on the individual choice attributes. The first group of control 
variates based on the attributes is 


E = 0,..., 4 


2=1 
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and the second is 


E E^mcWI,)-E^',}%(&). = 0,,,,, 4, 

i=l 

The motivation for this second set of control variates is that the parameters of the model 
are the means and variances of the individnal choice attribntes, see equation fl3ip . Since 
there are I individuals, we construct the control variates by averaging the posterior 
moment estimates of f3ki- Because of the correlation between the parameters (a:(l)) and 
the choice attributes (a;(2)) in the Markov chain, we expect these control variates to be 
highly correlated with the posterior mean estimates of the parameters. Moreover, the 
use of all twenty control variates simultaneously allows us to leverage the high posterior 
correlations for variance reduction. We estimate the optimal control variate coefficients 
as in the last section. 

Table [3] reports the estimated MSE for each method relative to MwG. The results 
are based on 500 independent Markov Chains with 20,000 iterations after 1,000 burn-in 
draws. The MIIS column in the table corresponds to the Rao-Blackwellized estimate 
given by fl2^ . We initialize every chain at the maximum likelihood estimate 
and approximate the “true” posterior means by averaging all the 500 MwG and MIIS 
estimates (without control variables). The results show that the benehts of using the 
MIIS Rao-BIackwellized estimates by themselves may be small or negligible because the 
autocorrelation in the MIIS chain is the main determinant of the total variance of the 
estimates in this example. When we use the Rao-Blackwellized estimates to construct 
the control variates, we obtain 75-95% reductions in MSE relative to the MwG algorithm. 
The two methods have similar computational cost and implementation effort. 


Table 3: Mixed Logit Application - Monte Carlo MSE of posterior mean estimates relative 
to MwG. _ 


Parameter 

MwG 

MIIS 

MIIS-CV 

Po 

1.00 

0.91 

0.07 

A 

1.00 

1.23 

0.06 

A 

1.00 

0.92 

0.05 

A 

1.00 

0.98 

0.06 

A 

1.00 

0.66 

0.08 

O-Q 

1.00 

0.95 

0.07 

o-i 

1.00 

1.02 

0.16 

0-2 

1.00 

0.94 

0.08 

0-3 

1.00 

1.17 

0.08 

(74 

1.00 

0.54 

0.25 
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6.2 Random Walk Importance Sampler 

6.2.1 Markov Modulated Poisson Process 


A Markov Modulated Poisson Process (MMPP) Yt is a Poisson process whose intensity 
\t takes on a discrete number d of values -0 = (f/’i, • • • with the intensity at any 

time point determined by the state of an unobserved continuous-time Markov chain with 
gene rator Q. We identify the model by imposing the parameter restriction > ■ ■ ■ > 


^pl. 


Sherlock et al. 


(120101) recently considered the MMPP as a challenging case study 


for comparing a range of Random Walk Metropolis (RWM) algorithms proposed in the 
literature. We replicate their setting to illustrate how the random walk importance 
sampler of Section 13.31 can lead to more efficient and robust MCMC simulation compared 
to standard RW samplers. 

Suppose that we observe a realisation of the pro cess over a certain time window and 
record n event times. Fearnhead and SherlockI ( 2006 ) derived the likelihood for the model 


as 




g^p(Q ^ exp(*^ 1, 


(32) 


where 

Q = ) , 

V q2i -q2i J 

V is the initial distribution of the latent state Zt (which we take to be the stationary 
distribution of the chain implied by Q), i.e., Pr(Zt = j) = z/(j), = diag('0), 6 is a vector 

of ones, ti is the time from the start of the observation window until the hrst event, ti is 
the time between events i—1 and i, and tn+i is time between event n and the end of the 
observation window. 


6.2.2 Simulation Study 

We replicate the simulation study in ISherlock et ahl (j2010l) . We simulate the MMPP 
model with d = 2 over an observation window of 100 seconds. The generat or matrix Q has 


Sherlock et al. 


parameters qi 2 = q '21 = 1. The intensity vector is ip = (10,17)'. As in the 
(l2010l) application, we complete the model by specifying exponential priors for all the 
parameters. The means of the priors are the true parameters. 

We consider t hree different me thods: the standard RWM algorithm, the multiple-try 
RWM (MTM) of Liu et al. (2000), and the MIIS random walk method (Algorithm 7). 
We consider a random walk on the transformed parameter vector 6 = (log('0i), log(V’ 2 ~ 
-^i), log(gi 2 ), log(g 2 i), which is more efficient than working on the original scale. Let i 
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index the current iteration of the Markov Chain. The proposal for all methods is 




where S is an estimate of the posterior covariance matrix of 6 based on a trial run of the 
RWM algorithm. The scale of the proposal aims to achieve an acceptance rate of 0.234 
in the standard RWM algorithm, wh ich is optimal rate under certain assumptions; see 


the discussion inISherlock et ah 


(120101) . We consider four control variates associated with 


each parameter for estimating each posterior mean 


forj = l,...4. 


As before, the control variates are the differences between the standard MCMC estimates 
and the Rao-Blackwellized estimates that reuse all the particles. 

We parallelize the likelihood evaluations over eight cores at every iteration of the MIIS 
Markov chain and set the number of particles to = 8 and = 16. Our discussion treats 
the MIIS method with N = 8 draws as being comparable to the standard RWM method, 
which is difficult to parallelize. This implies that the MIIS algorithm performs eight times 
as many likelihood evaluations as the standard RWM algorithm in total, but in the same 
amount of time under perfect parallelization. We report the actual computing times in 
the tables. We conhgure the MTM method such that it performs the same number of 
likelihood evaluations per iteration as the MIIS algorithm with N = 8 particles. However, 
the parallelization for the MTM method is less efficient as every iteration of the method 
requires two separate stages. 

The simulation study averages results over ten independent realisations of the DGP. 
For every realisation, we simulated 500 independent Markov chains for each method 
and ran each Markov Chain for 10,000 iterations after discarding a burn-in of 1,000 
iterations. We consider two cases for initialisation. We initialize the algorithm at the 
maximum likelihood estimate for half the chains. For the other half, we initialize the 
chain by drawing from the prior. We use the same draw from the prior to initialize all 
the methods at each replication. Initializing from the prior allows us to compare the 
convergence performance of each method. We then compute the posterior mean and 
variance estimates based on each chain. We combine all chains initialized at the true 
parameters to obtain precise approximations to the true posterior means and variances. 

Table S] reports the MSE efficiency of the posterior estimates relative to the per¬ 
formance of the RWM method. We average the results over the four parameters for 
conciseness. We also present the actual computing times, and average acceptance rate 
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and average integrated autocorrelation time (lACT). We base the last two results on 
longer independent chains with 100,000 iterations (one per simulated dataset). In the 
case of Mils, the “acceptance” rate is the proportion of iterations in which the sampled 
particle is not the previous iterate. We also report the relative time adjusted MSE for 
each method, which we dehne as (timeaigXmsea;g)/(timeMHXniseMr/)- This estimate ap¬ 
proximates the MSE relative to the MH algorithm for the same amount of computing 
time. We average all these results over realisations. 

The results show that the MIIS method reduces the time adjusted MSEs by 70% com¬ 
pared to MH when we initialize the chain at the true parameters. This gain in performance 
comes both from reductions in lACT and the use of Rao-Blackwellization to estimate the 
posterior moments. The MIIS method also outperforms the MTM method, at a lower 
computational cost. Using control variates further reduces the MSE, leading to a 78-83 
time adjusted improvement over MH. To put these gains in the c ontext of the randoin 

fern m 


Sherlock et ah 


walk literature, the best performing algorithm in the simulation of 
for the same DGP, a MH random walk in the log scale with with an adaptively tuned 
mixture proposal, leads to a 84% reduction in variance (measured by lACT) compared 
the least efficient algorithm in their analysis, a MH random walk with tuned proposal 
A^(0, A^J). The table also shows that when we initialize all algorithms from the prior, the 
MIIS algorithm with N = 8 particles and control variates generates 80-99% reductions 
in time adjusted MSE compared to the standard RWM algorithm. This result suggests 
that the MIIS algorithm is more robust to the initial conditions than the standard RWM 
and MTM algorithms. 


6.2.3 Empirical Example 

We now apply the RWM, MTM a. nd MIIS methods using data from the empirical example 


m 


Fearnhead and SherlockI (120061) . The data consists of positions (in bases) of Chi sites (a 


DNA motif) in the genome of Escherichia coli bacteria. The specification of the MMPP 
m odel is the same as in the simul ation study above. We follow the procedure described 

to obtain data-based parameters for the exponential 


m 


Fearnhead and Sherlock 


fl 2006 lj 


priors. We estimate the MMPP for the lagging part of the outer ring of the E. coli 
genome strand, which has 117 observations in total. We ran each Markov Chain for 
50,000 iterations after discarding a burn-in of 1,000 iterations and initialized all chains 
at the maximum likelihood estimate. We also use the Hessian of the likelihood at the 
maximum likelihood estimate to obtain the shape of the random walk proposal. 

Table |5] displays the Monte Carlo MSEs over 500 replications of each algorithm. The 
results show that the MHS-CV algorithm with 77 = 16 has 83—90% lower MSEs than the 
standard RWM algorithm. Adjusting for the actual computational times, the improve- 
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Table 4: MMPP DGP with ijji = 10 and '^2 = 17. Monte Carlo MSE of posterior 
estimates relative to MH (average across parameters). 

We define the time adjusted MSE as (timeaigXmseaig)/(timeMrrXinseMrr)) which approximates the 
MSE relative to the MH algorithm for the same amount of computing time. 





MHS 

MHS-CV 


MH 

MTM N=8 

N=16 

N=8 

N=16 

Acceptance 

0.30 

0.46 

0.32 

0.47 

0.32 

0.47 

lACT 

15.0 

9.3 

7.5 

5.6 

7.5 

5.6 

Time 

8 

15 

9 

14 

9 

14 


Initializing at the true parameters 

Mean 

1.00 

0.47 

0.24 

0.16 

0.15 

0.12 

Variance 

1.00 

0.67 

0.25 

0.14 

0.20 

0.15 




Time adjusted MSE 


Mean 

1.00 

0.88 

0.27 

0.29 

0.17 

0.21 

Variance 

1.00 

1.26 

0.29 

0.25 

0.22 

0.27 



Initializing from the prior 


Mean 

1.00 

0.07 

0.06 

0.02 

0.01 

0.01 

Variance 

1.00 

0.24 

0.19 

0.06 

0.05 

0.04 




Time adjusted MSE 


Mean 

1.00 

0.13 

0.07 

0.03 

0.01 

0.01 

Variance 

1.00 

0.46 

0.21 

0.10 

0.06 

0.06 
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ments are between 44—66%. The MIIS algorithm also outperforms the MTM method. 
We note from the table that the practical computational cost of adding particles tends 
to be low, so that we can consider a higher N to increase robustness. 


Table 5: Empirical example for the MMPP model - Monte Carlo MSE of posterior 
estimates relative to MH. 

We define the time adjusted MSE as (timeaigXmseaig)/(timeMrrXinseMrr)) which approximates the 
MSE relative to the MH algorithm for the same amount of computing time. 



MH 

MTM 

MIIS 

N=8 N=I6 

MHS-CV 
N=8 N=16 

Acceptance 

0.24 

0.54 

0.42 

0.57 

0.42 

0.57 

lACT 

55.2 

20.2 

13.3 

9.4 

13.3 

9.4 

Time 

20 

62 

55 

59 

58 

63 



Posterior mean 


bi 

1.00 

0.45 

0.25 

0.17 

0.18 

0.13 

b2 

1.00 

0.50 

0.21 

0.15 

0.15 

0.10 

912 

1.00 

0.39 

0.20 

0.14 

0.16 

0.10 

921 

1.00 

0.52 

0.27 

0.15 

0.15 

0.12 



Time adjusted MSE 


bi 

1.00 

1.39 

0.68 

0.50 

0.51 

0.43 

b2 

1.00 

1.55 

0.59 

0.45 

0.45 

0.33 

9l2 

1.00 

1.21 

0.54 

0.42 

0.46 

0.33 

921 

1.00 

1.60 

0.73 

0.45 

0.44 

0.37 



Posterior variance 


bi 

1.00 

0.53 

0.34 

0.19 

0.22 

0.17 

b2 

1.00 

0.65 

0.29 

0.21 

0.18 

0.11 

9l2 

1.00 

0.45 

0.20 

0.12 

0.17 

0.12 

921 

1.00 

0.50 

0.23 

0.14 

0.15 

0.10 



Time adjusted MSE 


bi 

1.00 

1.66 

0.93 

0.56 

0.64 

0.56 

b2 

1.00 

2.02 

0.81 

0.63 

0.51 

0.36 

9l2 

1.00 

1.39 

0.55 

0.35 

0.50 

0.38 

921 

1.00 

1.55 

0.63 

0.41 

0.45 

0.33 


7 Theory 

This section presents our theoretical results for the MIIS estimators and restates some of 
the dehnitions in previous sections in a more general setting. 

Let (%, f2) denote a measurable space and n some given target probability distribution 
on (%, f2). Assume that a reference measure /i dominates tt (tt /t) and that 7r(da:) = 
n{x)fi{dx). With a small abuse of notation, we write n{x) for the density of the probability 



measure vr with respect to fi. In most situations A C 3?'^ (h G N, the set of positive 


integers), = B{A) is the Borel cx-algebra of the set A, and the majorizing measure fi 
is either the counting measure, the Lebesgue measure, or a combination of both. We 
assume that the distributions ri{d^\y) and gfc(da;|.^) {k = 1,..., N) admit densities ri{^\y) 
and qk{x\^) with respect to the same measure y. We work interchangeably with other 
distributions and their corresponding densities. Let A = and = fli®-• 

In the conditional case, for all s = l,...,(i, let denote a measurable spaces. 

We assume that 7 rs(d|/(s)||/(\s)), qs,k{<^y{s)%,y{\s), and ris{<^is\y{s),y{\s)) are dehned 
on and have densities with respect to some majorizing measure that may 

depend on y{\s) G for = xf^^Ai. 

7.1 Convergence of the marginal MIIS chain 

If (i/, k) is marginally distributed as iV“^ 7 r, then the CIS estimator is unbiased by Theo¬ 
rem, [T](iii). Theorem [5] below shows that the MIIS Algorithm (Algorithm [ 2 ] in Section 1273]) 
samples from the target density asymptotically, i.e., as the number of iterations 

t —)• oo. In other words, the marginal distribution of {y^^\ k^^^) is A^“^ 7 r, asymptotically. 
For all l,k E {1W} and y, z E A, dehne 



(33) 


where (ii,k{xi,Xk\^) is the joint marginal of {xi,Xk) for I ^ k. 

The proof of Theorem Elis based on the following assumption discussed in Section [7721 

Assumption 1. (i) There exists a constant C, 0 < C < oo, such that the marginal 

densities qk{x\^) satisfy 7i{x)ri{f\x) < C qk{x\f), for each k and all x,f E A. 

(a) (a) For each k,l E {IW} and y,z E A, there exist functions hk^i{y,z) such that 


Si,k{^,y,z)y{f\y)7]{f\z)y{df) > hi^k(,y,z). 


A 


(b) For each I E {IW} there exists a set Ji C {1:A}\{/} such that: JiAJk 7 ^ 0 
for I 7 ^ k; and 

(c) for all j G ffi and y,z E A hij{y, z) > 0 and hj^i{y, ^) > 0. 

Assumption [H (i) requires the weights to be uniformly bounded and it is often used in 
the particle literature. This condition is not restrictive and can be enforced by choosing 
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suitable qk and rj. Part (ii) is a technical condition that imposes regularity conditions on 
the pairwise dependence of the particles, on the kernel T, and the auxiliary distribution 
rj. 

Theorem 5. (i) If Assumption{l\ holds then the marginal chain , sampled 

using MIIS, is Markov and ergodic, i.e., 


lim \ P\l,y;-)-N i7r(-)|Ty = 0, 

t^OO 


where P{y^l] Bx{kl\) is the Markov transition kernel from {y,l) to Bx{k}, B 
k e 


(ii) If, in addition, for k G Ji, hi^k{y,z) > > 0, i.e., h^^ does not depend on the 

initial value y E A, then the marginal chain is uniformly ergodic. 

The distribution of the marginal chain {{y^^\ j converges to the target distribution 
A^“^7r as the number of iterations increases. It means that, after a warm up period, 
the marginal distribution of samples from the chain is and, hence, Eff(j{f) is an 

unbiased estirn ator of E^^^f) for any integrable /. If i?^(|/|) < oo, then by Theorem 3 of 
TiernevI (jl994j) . ergodicity implies that E)f^{f) is also a consistent estimator of E.,. ( f). 
If EM") < oo and uniform ergodicity holds then by Theorem 5 of iTiernevI 019941) we 
also obtain a central limit theorem for E^^^f). 


7.2 Convergence results for the examples in Section [3] 

This section discusses the application of Theorem O to the examples in Section [31 In all 
three examples T is the identity kernel, i.e., T{y, dz;f,) = 6y{dz). This gives Si^M, y, z) = 
^.hk{,y,z\i)/ql{y\i)qk{z\i) ioi k ^ I and SiM^y^z) = I{z = y)/qi{y\f)- Hence, we require 
hi,k{y, z) > 0 functions such that 

[ vi^\y)vi^\z)-Y'^^Y^^j^yidf) > hiM^z), for Mk 

Ja (ii{y\0Qk{z\0 

Tt ^ f v{^\y)v{^\z) ^ ^ , 

Iiz = y)j^^^^^ym>hM,z). 

Part (i) is assumed explicitly and we choose to satisfy Assumption [Tlii). 

Simple importance sampling example 

This example is discussed in Section 13.11 
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Corollary 1. Suppose that there is no dependence ofT and the qi on ^ and (i) T{x, dy) = 
6x{dy), (a) q{dxi.,pf) = ^ Cqi{dxi), where C > 0 is a 

positive constant. Then, the marginal chain {{y^^\ is uniformly ergodic for N > 3. 

Let ? 7 (d.^||/) = 5 o(d 05 without loss of generality. It is easy to see that hi^k{y, z) = I {I ^ 
k) is a valid choice. Assumption [T] is satished by taking Ji = and N > 3. 

Uniform ergodicity follows from Theorem [S] part (ii), because hk,i{y,-) does not depend 
on y for I E Jk- 

Importance sampling with antithetic variables 

This example is discussed in Section 13.21 

Corollary 2. Suppose there is no dependence on f and (i) T{x,dy) = S^^dy) = 6x{y)dy, 

(iij q(^dxi.j\f') = qi,i+N/ 2 {dxi, dxi^]\f/ 2 ') such that qi^]\j/ 2 {dxi^]\f/ 2 \xi) = 

where Qi is the cdf of qi{xi). (Hi) nidxi) < Cqiidxi), where C > d is a positive constant 

Then, the marginal chain {{y^^\ ^5 uniformly ergodic for N/2 > 3. 

It is straightforward to check that, 

q»(».»-) k€{l:N}nll-N/2.N/2+l}.k^l 

qi{y)qk{z) k e {1 :N}\{i,i-n/2,n/2+i} 

Choose Ji = {l-.N}\{l,l—N/2,N/2+l}. It is easy to see that hi^k = I{k G Ji) is a valid 
choice. Assumption [1] is satished and the MIIS sampler is uniformly ergodic using the 
same arguments as in the previous example. 

Random walk importance sampler 

This example is discussed in Section 13.31 

Corollary 3. Suppose that (i) T{x, d|/|0 = d^idy) = 6,,{y)dy, (ii) q(dxi:Ar|0 = <?i(da::i|0 

and (zu) gi(dxi|0 = (l)ixi-f)dxi; (iv) pidQy) = (j){f-y)df; (v) (t){x-y) > 0 for any 
x,y E A. (vi) vr(xj) < C. Then, the marginal chain {{y^^\ /cd))| is ergodic for N >3. If 
m.iz,y&A f </>(^—z)</>(^—y)df > e > 0 Then uniformly ergodic. 

For I 7 ^ k, Si^k{f,,y,z) = 1 because the proposals are independent, and 

h{y,z) := hk,iiy,z) = [ (j){f-y)(j){z-f)df > 0. 

Ja 

Choose fh = {1:A^}\{/}. Then Assumption [1] holds and ergodicity follows from Theorem 
O By assumption there exists e > 0 such that h{y, z) > e for all y,z E A. By dehning 
hikiz) = e for k E Ji, uniform ergodicity follows from part (ii) of Theorem [5l 
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7.3 The Mils Gibbs Sampler 

This section shows that the marginal chain generated by the MIIS Gibbs 

sampler (Algorithm E] in Section [4.2p is ergodic if (i) the ideal Gibbs sampler, i.e., the 
Gibbs sampler drawing variables from the conditionals 'n's{dy{s)\y{\s)), is irreducible and 
aperiodic; (ii) the GIS Gibbs sampler satisfies regularity conditions that are similar to 
Assumption dl but hold for each s = 1,... ,d; (hi) The space A is Euclidean with Lebesgue 
measure the underlying measure. 

Our notation assumes that we condition on y{\s) when dealing with the sth compo¬ 
nent and do not usually show this conditioning explicitly. The transition kernel for the 
ideal Gibbs sampler is 


PG{y,dz) := Jj7r^(d2:(s)|2:(l:s-l),?/(s-M:d)) 

i=l 

For all l,k E {1:A^} and ^{s),y{s), z{s) G As, define 


(34) 


Ss,i,k{^{s),y{s), z{s)) := / Ts{y{s), dxi{s)-,^{s))Ts{z{s), dxk{s)-,^{s)) 

Ja^ 


Ss,iA^{s),y{s),z{s)) ;= 


f Tsjyjs), dxi{s)-, ^{s))Ts{z{s),xi{s);^{s)) 


I = k 


(35) 


where qs,i,k{xi{s), Xk{s)\^{s)) is the joint marginal of {xi{s),Xk{s)) for I ^ k. 

The proof of Theorem [6] is based on the following assumption, which generalizes 
Assumption [1] to the Gibbs case. 


Assumption 2. The following condition holds for all s = 1,... ,d. All terms are condi¬ 
tional on y{\s), unless stated otherwise. 

(i) There exists a constant C, 0 < C < oo, such that the marginal densities qs^k{x{s)\f{s)) 
satisfy 7is{x{s))ri{f{s)\x{s)) < C^^'^qs, k{x{s)\f{s)), for each k and all x{s),f{s) G 
A,. 


(ii) For each k, I G y{s), z{s) G Ag and y{\s) G 

(a) There exist functions hs^k,i{y{s), z{s)) > 0 such that 



Ss,i,k{^{s),y{s), z{s)) r]{f{s)\y{s))r]{f{s)\z{s))y{df{s)) > hs,i,k{y{s), z{s)); 


(b) for each I G {1:A}, there exists a set Jg^i C Js^iAJs^k 7^ 0 fori ^ k; 
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(c) for each j e Js,i, hs^ij{z{s),y{s)) > 0 and hsj,i{z{s),y{s)) > 0 on y e {x e 
A : 7r{x) > 0} and z(s) G Ag. 

Define 1 := li-,d and k := ki:d- 

Theorem 6. Suppose Assumption\^ holds. If Pg is irreducible and aperiodic, then so is 
the marginal kernel Puiy, 1; d^xk), and for any starting values y E A with 7i{y) > 0 and 

1 G 

to |Pl,(j/,li.)-/V-'7r(.)|^j, = 0. 

Gibbs Sampler with simple importance sampling example 

We now consider the example of the Gibbs sampler with simple importance sampling 
discussed in Section 14.31 

Corollary 4. Suppose that there is no dependence on ^ and the following conditions 
hold for s = (i) Ts{x{s),dy{s)\y{\s)) = 4(^)(d|/(s)), (zi) qs{dyi:Nis)\y{\s)) = 

n*=i?s,*(%(-5)ll/(\s)), (Hi) There is aC > 0 such that qg^i{dyi{s)\y{\s)) > C^/’^'Kg{dyi{s)\y{\s)). 
If we further assume that the ideal Gibbs sampler, Pg, is irreducible and aperiodic, then 
the distribution of the marginal chain {\T> ,yh)^t > 1} converges to the full target 
as t ^ oo for any fixed N >3. 

This corollary follows after the same arguments used in the marginal case. The 
functions hg^i^k = I7^ k) and the sets = {l:iV}\{/} for each s = 1,... ,d. The result 
follows from Theorem [3l 

7.4 Consistent estimation of expectations 

Using all the particles 

The next theorem shows that the MIIS estimator E^'^jg{f) discussed in Section [5] con¬ 
verge to E^{f). 

Corollary 5. Let f : A M. be such that U^rd/I) < oo and suppose Assumption\J\ holds. 

Then the MIIS estimator E^'^jg{f) —)■ E^^^f) with probability one as M ^ oo, for any 
N>2. 

Using Rao Blackwellized estimators 

Dehne the Rao-Blackwellized estimators E^^{f) and E ^^(/) as in Section xxx. Then, 

Corollary 6. Let f : A M be such that U^(|/|) < oo. Suppose Assumption\^ holds. 

Then, the Rao-Blackwellized estimators E^^{f) and E ^^(/) converge to E^^^f) with 
probability 1 as M ^ oo. 
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Using Control Variates 


The following two results shows that the estimators based on control variates discussed 
in Section 15.31 are consistent under ergodicity. 

Corollary 7. Let f : A M. be such that U^(|/|) < oo. Suppose Assumption^ holds. 
Then the estimator using control variates —?• ET,{f) with probability one as 

M —)■ oo, for any n G 

Corollary 8. For any s = 1,... ,d, let f ■. A ^ be such that U^(|/|) < oo. Suppose 
Assumption\B holds. Then the estimator using control variates E^^y{f,9) —)■ Ej^^f) with 
probability one as M ^ oo and any n G 
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A Proofs 

The notation in this section is the same as in Section [71 


A.l Markov Interacting Importance Sampler 

Proof of TheoremUl Part (i): From ([7]), is a proper distribution function that in¬ 
tegrates to 1 and has marginal 7i^{dy) = n{dy). Part (ii): The joint distribution 
7r'^(dxi:Ar, d^, k) is 


7r^(dxi,7v,d^, A:) = / 7r"(dTi:Ar, d^, di/. A:) 


N/ 


= N 7r{dy)r]{df\x)T{y,dxk,0<J\k{dx\k\xk,0 


= N Tr{dxk)r]{df\xk)T{xk,dy,f)q\k{dx\k\xk,0 

Ja 

= N~^7i{dxk)r]{df\xk) q\kidx\k\xk, 0 
7r{xk)ri{df\xk) 


Nqk{xk\0 

Wk{Xk\0 


N f^m(x)ju(dx) 


q(dxi-,Njxk,0 
q(da;i,7v|0, 


The second line is the joint distribution, the third line follows from reversibility of the 
Markov kernel, the fourth line integrates out y, and the last line follows from the dehnition 
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of the weights. The conditional distribntion 


Tr^(K = k\xi:N,0 




WkjXklO 


Wk{Xi:N,0^ 


Similarly, 


7i^{dy\xi,N,^,k) 


^^{dxi:N,d^,dy,k) 

7f^(da;i:7v,d^, fc) 

N-^7r{dy)r]{d^\x)T{x, dxk, Q q\fc(dx\fc|fc, Q 
N-^7i{dxk)ri{d^\xk) q\k{dx\k\xk, 0 

7r{dxk)v{d^\xk) q_\k{dx\k\xk, Q 
Tj-(^(ixk)ri{d^\xk) ci\k{dx\k\xk, 0 
T{xk,dy;^). 


Part (hi): 

^^{k,dxk) = j TT^{k,dy,di,dxi,N) = N~^ j Tr{dy)r]{d^\y)T{y,dxk;0^\ki(ix\k\xk,0 

= N~^7i{dxk) J r]{d^\xk)T{xk,dy;^)q\k{dx\k\xk,0 = ■ 

Hence, 

N 

E^^ifiXK)) = '^i^^k)f{xk) = E^{f). 

k=i y 

Similarly, by hrst conditioning on Xi-j^ and we obtain 

e^mUXXk)) = 

N 

= E:k^{N-^Y. / f{Xk)Wk{Xi..N,OT{XkAy,i)) 

k=l y 

□ 

Proof of Theorem O The proof follows from Part (ii) of Theorem [T] and because 
T^(d^, dxi.,N\y, k) = P^(d^, dxi.,N\y, k) □ 
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A.2 Markov Interacting Importance Sampler for Conditional 

Distributions 

Proof of Theorem 0 The proof is analogous to the proof of Theorem [H with tt replaced 
by 7r,(-|x(\s)). □ 

Proof of Theorem We write the MIIS Gibbs sampler as a Gibbs sampler in an aug¬ 
mented space. Each step of the algorithm consists in sampling from the following col¬ 
lapsed Gibbs sampler. 

Algorithm 8. For s = 1,... ,d, 

(i) Sample Xi,N{s),f{s)\{y{s),ks,y{\s),f{\s), {xi,N{\s)),k\s) 
from Tf^{dxi.,N{s),df{s)\y{s),ks); and 

(n) Sample Y{s),K,\xi,N{s),i{s),{yi\s),i(\s)) from 

N 

'^Ws,i{xi,N{s)]f{s)))I{Ks = i)T{xi{s),dy{s)]f{s)). 

i=l 

To prove the theorem it is sufficient to show that the conditional density 

7f^(d|/(s), d^(s), ks, da:i:Ar(s)||/(\s), ^(\s), a;i: 7 v(\s)) 

gives the sth step in Algorithm [S] above. The proof uses the same arguments as those in 
Theorem |2l The joint distribution 

(da;i:jv(0G e \s),ki,d) = JJ Tf (da;i:jv(0, 1/(V)), 

after integrating out (xi:Ar(s), ,^(s)). Hence, the conditional joint distribution 
7f^(da;i,jv(s),d^(s)|d|/,^(\s), e \s),ki,d) = bf (da:i:Ar(s), d^(s)|a;(s),a;(\s)), 

which is consistent with part (i) of Algorithm [HI Similarly, 7f^(d|/, d^(s), da:i:Ar(s), fc*) = 
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N ^7i{dy)xT^{dxi.,N{,s),d^{s)\y{s),ks,y{\s)), so 


7r^{dy{\s),d^{s),dxi.,Nis),ks) = I N V(d|/(s), d|/(\s)) xTf (da;i:Ar(s), d^(s)|a;(s),l/(\s)) 


Ni 


= N V\,(d 2 /(\s))q,(da;i:jv(s)|^(s),d?/(\s)) 
Vs{d^{s)\y{s),y{\s))7rs{dy{s)\y{\s)) 


X 


g.,fc.(da;fc,(s)|^(s),2/(\s)) 


T(?/(s),da;fc,(s);^(s),?/(\s)) 


= N \\s{dyi\s))qs{dxi..N{s)\^{s),yi\s)) 
Vs{d^{s)\xkM,y{\s))T^s{xks{s)\A\s)) 

(ls,kAxkM\^{s),x{\s)) 

T^\s{dy{\s)) 


oc 


N 


qs(xi: 7 v(s)|^(s), a;(\s)) w,,fc^(a;fc^(s), ^(s), a;(\s)). 


Hence, Pr(i^^ = ks\y{\s),^{s),Xi.,N{s)) = Ws,ks(,Xi:N{,s),^{s),y{\s)), which is consistent 
with Step (ii) of Algorithm |8] Following the same arguments as in the proof of Theorem 
121 we can check that T^(d|/(s)|r/(\s), ^(s), a;i, 7 v(s), fcj = T(a:fc^(s), dr/(s), ^(s), x(\s)). Fi¬ 
nally, one can verify that the algorithm targets vr by hrst integrating out {xi:N{i),^{i)), 
i = 1,... ,d, and then summing over ki,... ,kd- □ 


A.3 Convergence of MIIS 

Before proving Theorem [21 we obtain a preliminary lemma. 
Lemma 1. Suppose Assumption IJ\ holds. Then, 

(^) 


P{y,l,dzx{k}) > y^^j^hi^k{,y,z). 


(ii) Recursively define Hi^k{y,z) = hi^k{y,z) and 


N 


rri+l 

^l,k 


{y,z) := E^-.^[Hlj{y,V)hj4V,z)] = 


Hlj{y,v)hj^k\ 


V, z 


\Ti{dv). 


i=i 


Then, 


Pfiy,l]dzx{k}) > 



(36) 


(in) Hf i^{y, z) > 0 for t >2 for all y,z ^ A. 

Proof. We first obtain Part (i). Assumption [H Part (i) implies that H4(a;i:v;0 — 
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Wk{xk;^)/CN. Hence, for k ^ I, 


{CN)P{y,l;dzx{k})> 

= 7i{dz) 
= 7i{dz) 
= 7i{dz) 


-K{dxk)r]{^\xk) 


^iv+i qk{dxk\0 

[ v{^\^) 

an +1 qk{dxk\0 


T{xk, dz; Or^(da:i:Ar, d^\y, 1) 
T{z, dxk, 0r^(dxi,7v, d^l?/, /) 


f v{^\z)v{d^\y) 

ly^N+i qk{dxk\Oqi{dxi\^) 

f qi^k{dxi,dxk\0 


T{z, dxk] OT{y, dxi] 0q(da;i:iv|0 


/^3 qkidxk\0qi{dxi\0 


T{z, dxk] OTiy, dxi; Ovi^\z)vid^\y) 


= 7r(dz) / v{^\z)y{d^\y)x 


r qi^k{dxi,dxk\0 
Ja^ qk{dxk\Oqi{dxi\^) 


T{z,dxk;OT{y, dxi;^) 


= 7r(dz) / y{^\z)y{^\y)Si^k{^,y,z)y{d^) 

Ja 

> 'K{dz)hi^k{y,z). 

We can similarly result for k = 1. We now prove part (ii). By part (i), Eq. fl36p holds for 
f = 1. Suppose that fl5^ holds for some t. Then, 

N 

P^+\y,l;dzx{k}) = Y^ / P\y,k,dvx{j})P{v,j;dzx{k}) 


j=i 

f 7i{dv) , ^(d^), 


5 ) AhjAv, A 


1 \ 7r(d2;) 


C} N 

t+i 


C 


7r(d2:) 
1^ 


^N-.^[Hlj{y,V)hj^k{V,z)] 


Hence, the bound holds for all t. We now prove Part (hi). We hrst show that Hf i^{y, z) > 0 
for t = 2 and then, recursively, for all t >2. For any pair y, z ^ A, and /, fc G {l:iV}, 

P^kiy^^) = '^N-^TT[hi,j{y,V)hj^k{V, z)] >'EN-i^[hi^j{y,V)hj^k{V,z)I{J G jTinjTfc)] >0, 

where the last inequality follows from Assumption [1] Part (ii). If Hfj{y, •) > 0, then 

H^;,\y,z) = E^-i^[Hlj{y,V)hj,k{V,z)] > E^-^^[Hlj{y,V)hj,k{V, z)I{J G Jk)] > 0. 
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□ 

Proof of Theorem\^ The sequence from the MIIS algorithm is Markov, be¬ 

cause the MIIS algorithm is a two component Gibbs sampler, and has transition kernel 


PiyJ;Bx{k}) = / Wk{xi..N,OTixk,B;0^^^dx,..NAf\y,l). 
Ja^+^ 


(37) 


The proof shows that for all starting values {y,l) G 74x{l:iV}, the step Markov 
transition kernel P^{y, 1] B x {k}) is positive for all t > 2, and any S G G such that 
7r(i?) > 0 and k G {1:77}. 

Suppose that y E B E Vt and k,l G {1:77}. If vr(i?) = 0 then P^{y,l; Bx{k}) = 0 
for t > 1; if 7i{B) > 0 then P*{y,l; Bx{k}) > 0 for all f > 2 by Lemma [H This 
means that the marginal chain is 77“^7r-irreducible and aperiodic and that P{y, /; dzx {fc}) 
is absolutely contin uous wit h resp ect to 77“^7r(d^). It then follows from Theorem 1 


and Corollary 1 in iTiernevI (1994) that for all {y,l) G Ax{l:77}, \imt^oo\P^{y,k, — 
iV-V(-)| TV = 0, proving the hrst part of the theorem. Proof of second part. Dehne 
gi{z) := > 0. Then, 

Hlkhj^z) = ^ f Tr{dz')hi,k'{y,z')hk',k{z\z) > ( / '^{dz')gi{z')^ ^ hk',kiz)- 


k'GJi 

Let L>i := / Tr{dz')gi{z'), 


k'GJi 


L >2 := f h^,^kiz)7i{dz) and u{B) := j h^, fc(z)7r(dz). 

k'&Ji k'eJi 


Then, from 


P\y,l;dz,{k}) > C-^DiD 2 N-^u{dz). 


and uniform ergodicity follows from Proposition 2 in iTiernevI fjl994l) . 


□ 


A.4 Convergence of the MIIS Gibbs Sampler 

We again consider the marginal chain {yt,lt,t > 0} of the MIIS sampler, where h : = 
{h-.d)t- Let Ps,M{y{s), Is] dz{s)x{ks}\y{\s)) be the transition kernel for the s*^ component 
of the marginal chain. The transition kernel for the marginal chain is 

d 

PMiy,l]dzx{k}) = JJP^,M( 2 /(s),G;d 2 ;(s)x{A;J| 2 ;(< s),y{> s)), 

S=1 
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where we use the shorthand notation z(< s) = 1) and y{> s) = y{s+l:d). Dehne 


hi,]^{y,z) := s),y{> s)). (38) 

We require the definition of the sub-stochastic kernel Hi k{y, dz) = C~^N~‘^hi^k{y, z)PG{y, dz) 
and, iteratively, 

Y1 [ dn) hj,k(n, z)PGiv, dz) 

= ^ / ^ij(l/w)^j,k(^,d2;)PG(2/,dn) 

= ^PG(y,-)/N‘i [hi,jiy, V) dz)] . 

Lemma 2. Suppose Assumption [H holds. Then, 

(i) The marginal chain {y^t), > 0} is Markov. 

(a) For t = 1, 2,... 

^M(l/,l;d^x{k}) > Hl^{y,dz). 

(Hi) Suppose t > 2, B E Q, and 7r{y) > 0. If PQ{y,B) > 0 then H\^{y,B) > 0. 

Proof. Part (i) follows from the construction of the MIIS sampler. 

We show part (ii) by induction. By part (i) of Lemma [H for each s = 1,..., d, 

Ps,M{y{s)^s]ks,dz{s)\y{\s)) > C~^/'^N-^hs,i,,ks{y{s),z{s)]y{\s))Tis{dz{s)\y{\s)). 

Hence, for f = 1, part (ii) follows form the definition of PuidjA] zx{\<i\) and H\y{ii,dz). 
Suppose Pf[{y,\]dvx{]}) > H(^{y,dv), for dn G 12 and j G Then 

^M^( 2 /)l;dzx{k}) = f Plj{y,\;dvx{j})PM{v,y,dzx{k}) 

>C-^N-^ j Y h^.{v,z)Hl.{y,dv)PG{v,dz) 

= Hl^\y,dz). 

Then part (ii) also holds for proving the result. 
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Part (iii) follows By induction. We first show that the result holds for t = 2 and, then, 
we show that if the result holds for some f > 2, it also holds for f+1. Let J\ = 
and verify that, under assumption [2]^ii) part b, jTifljLk 7 ^ 0, for any pair 1, k G {IW}. 

Suppose t = 2. If Pq{ii,B) > 0, there is a set F' ^ Q such that PG{y,F') > 0 and 
Pg{x,B) > 0 for X G F'. Let F' ^ F = FiX- ■ - xF^. For x G F (i.e., each i;(s) G Fg), 
s = and j in Js,f^Js,k, 

hs 7 j( 2 /(s), t;(s); t;(< s), y{> s))/i,j,fe(x(s), ^(s); z{< s),v{> s)) > 0 , 
from Assumption [2]J^ii) part (c). Therefore, 

d 

s), 2 /(> s)) 

je{i:W} S=1 

><hg^j^kXv{s), z{s)-, z{< s),v{> s)) > 0 


for r; G F, and any 1, k G Hence 


> 


C2 

1 


Hlk{y,z) = — / Y hi,j{y,v)hj^i,{v,z)PG{y,dv)PG{v,dz) 






Y z)PG{y, dv)PG{v, dz) ^ > 0 , 


where the last line follows from calculating each integral between brackets over Fg, s = 

1 , • • • ,d. 

Suppose that part (iii) holds for some t >2 and that P^^{y, B) > 0. Then, 


H\%\y.B) = 


CN<i 


E 

jefuTV} 



B J A 


H\j{y, d'y)hj,k(w, z)Pg{v, dz) 


> 


cm 



B J F 


YHli^yAv)hiAv,z) 

Ljeyk 


PG{v,dz) > 0 , 


where F G H is such that Pg{,x, F) > 0 for x G F and P^iy, F) > 0. The result holds for 
any 1 and k in □ 


Proof of TheoremlBi The result follows from Lemma [2] and Theorem 1 in lTiernevi fjl994|l . 
First dehne the Markov kernel [N~'^PG]{y,l] Bx{k}), that is the kernel of the Gibbs 
sampler that draws {z{s), kg)\{z{< s),y{> s),k<s,l>s) from A^“^ 7 rs(^(s)| 2 ;(< s),y{> s)), 
sequentially. If the Gibbs kernel Fq is irreducible and aperiodic, so it is the kernel 
[N- ■^Fg], since all kg G {IW}, s = 1,..., d, are accessible at each iteration. The proof 
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consists in showing that accessible sets from [N~'^PgY, the ideal Gibbs in t > 2 steps, 
are also accessible by the MIIS-Gibbs kernel after t iterations, Lemma [2] (i) shows 
that Pm is a Markov kernel. Parts (ii) and (hi) together show that PG{y,B) > 0 implies 
that Pm(|/, k; Px {1}) > 0 for any pair (1, k). Hence, all sets accessible by [N~‘^Pg] are 
also accessible by Pm, which implies that Pm is also irredncible. To show that Pm is 
aperiodic, we assnme by contradiction that Pm is not aperiodic. In this case, [A^’ -‘^Pg] 


wonld have to be periodic as well, which contradic 
is aperiodic. The result follows from Theorem 1 of 


;s with 


he ass umption that [N '^Pgj 
T iernevI (1994). □ 


It also follows from Theorem [H] that, hm 4 _ 5 .oo Pm( 2/(-5), •|?/(\s)) = N ^7rs(-||/(\s), 

which implies that the control variates in Section 15.31 can be safely used. 


Proof of Corollary^ We can check that the conditions of Assumption 2 hold in a similar 
way to the proof of Gorollary [T] The result follows from Theorem [6l □ 


A. 5 Proofs of consistency 


Proof of Corollary\^ The distribution of converges to A^“^7r(-) by Theorem O 

Let Ecistif) ke defined by Equation fl22|) in Section [5Tl The result now follows from 
Lemma [H which shows that e ach A f) is unbiased and by the strong law of large 
numbers for ergodic sequences flTiernevl . 11994 Theorem 3). □ 

Proof of Corollaryl^ The distribution of converges to by Theorem E] 

The resu lt follow s from Lemma [3] and the strong law of large numbers for ergodic se- 

□ 


quences (iTiernevl . ll994J. Theorem 3). 


Proof of Corollary^ For any / with P^(|/|) < cxo, it follows from Gorollary [5] that 
EmcU) ^(/)) E^’YisU) with probability one. This means that Pmc(/)“ 

pMHsif) k’ with probability one. Hence, for any constant n G and 7r-integrable 
functions Qi-, ■ ■ ■ ,gp, the linear combination ^ with prob¬ 
ability one. The proof now follows from Gorollary [5] □ 


Proof of Corollary The proof of this corollary follows the same arguments used in the 
proof of Gorollary [71 □ 
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